应用层协议原理 principles of network applications

网络应用体系结构(application architecture):

客户-服务器体系结构(client-server architecture)

server:

always-on host
permanent IP address
data centers for scaling

clients:

communicate with server
may be intermittently connected
may have dynamic IP addresses
do not communicate directly with each other

P2P体系结构(P2P architecture): peer-to-peer

no always-on server
arbitrary end systems directly communicate
peers request service from other peers, provide service in return to other peers:

self scalability – new peers bring new service capacity, as well as new service demands

自扩展性(self-scalability)

peers are intermittently connected and change IP addresses:
complex management

进程(process)通信

process: program running within a host

within same host, two processes communicate using inter-process communication(defined by OS)
processes in different hosts communicate by exchanging messages

客户和服务器进程:
在一对进程之间的通信会话场景中，发起通信(即在该会话中开始时发起与其他进程的联系)的进程被标识为客户，在会话开始时等待联系的进程是服务器

client process: process that initiates communication

server process: process that waits to be contacted

aside: applications with P2P architectures have client processes & server processes

进程与计算机网络之间的接口：应用程序编程接口(Application Programming Interface, API)

套接字(socket)

process sends/receives messages to/from its socket
socket analogous to door:

sending process shoves message out door
sending process relies on transport infrastructure on other side of door to deliver message to socket at receiving process

应用程序开发者对运输层的控制仅限于

选择运输层协议
设定运输层参数如最大缓存和最大报文段长度等

进程寻址：

主机地址：IP地址(IP address)
目的主机宏接受进程的标识符：端口号(port number)

to receive messages, process must have identifier
host device has unique 32-bit IP address
Q: does IP address of host on which process runs suffice for identifying the process?
A: no, many processes can be running on same host
identifier includes both IP address and port numbers associated with process on host.

example port numbers:

HTTP server: 80
mail server: 25

应用程序可用运输服务

可靠数据传输(reliable data transfer)

容忍丢失的应用(loss-tolerant application)

吞吐量

带宽敏感的应用(bandwidth-sensitive application)
弹性应用(elastic application)

定时
安全性
data integrity

some apps (e.g., file transfer, web transactions) require 100% reliable data transfer
other apps (e.g., audio) can tolerate some loss

throughput

some apps (e.g., multimedia) require minimum amount of throughput to be “effective”
other apps (“elastic apps”) make use of whatever throughput they get

timing

some apps (e.g., Internet telephony, interactive games) require low delay to be “effective”

security

encryption, data integrity

因特网提供的运输服务

QoS: Quantity of Service

TCP服务：面向连接服务和可靠数据传输

TCP service:

reliable transport between sending and receiving process
flow control: sender won’t overwhelm receiver
congestion control: throttle sender when network overloaded
does not provide: timing, minimum throughput guarantee, security
connection-oriented: setup required between client and server processes

UDP服务：不提供不必要服务的轻量级运输协议，无连接的

UDP service:

unreliable data transfer between sending and receiving process
does not provide: reliability, flow control, congestion control, timing, throughput guarantee, security, or connection setup,

安全套接字层(Secure Sockets Layer, SSL): TCP的加强版本，增加安全性服务

TCP & UDP:

no encryption
cleartext passwds sent into socket traverse Internet in cleartext

SSL:

provides encrypted TCP connection
data integrity
end-point authentication

SSL is at app layer

apps use SSL libraries, that “talk” to TCP

SSL socket API

cleartext passwords sent into socket traverse Internet encrypted

应用层协议(application-layer protocol)

交换的报文类型
报文类型的语法
字段的语义
进程何时及如何发送报文，对报文响应的规则
types of messages exchanged,

e.g., request, response

message syntax:

what fields in messages & how fields are delineated

message semantics

meaning of information in fields

exchange rules for when and how processes send & respond to messages

open protocols:

defined in RFCs
allows for interoperability
e.g., HTTP, SMTP

proprietary protocols:

e.g., Skype

Telnet

TELNET Protocol Offers three basic services:

Defines network virtual terminal (NVT)

Basic terminal operations
standard interface

Allows client and server negotiate options
Treats both ends symmetrically

No force on client; it can be arbitrary program

Network Virtual Terminal:

Different OS’s, different keys : Ctrl-c <–> ESC
different End of line : CR, LF, CR-LF

TELNET defines standard

7 bit US ASCII
their interpretation of control(e.g., CR-13,LF-10)
End of line is CR-LF
Some 8 bit octets for control functions

rlogin – understands Unix environment better

Passes some terminal parameters automatically
Trusted machines do not require password
The Rlogin process uses the TCP port 513

文件传输协议: FTP

FTP使用两个TCP连接传输文件

控制连接(control connection)

两主机之间传递控制信息：用户标识、口令、改变远程目录和’put’/‘get’等

带外传送(out-of-band)

数据连接(data connection)

FTP命令和回答：7比特ASCII格式在控制连接上传送，每个命令跟回车换行符(区分命令)

每个命令4个大写字母ASCII字符，部分具有可选参数 e.g. USER username、PASS password、LIST、RETR filename、STOR filename 等

回答是一个3位数字，后跟可选信息
e.g. 230 User login, OK、331 Uername OK, Password require、125 Data connection already open; transfer starting、425 Can’t open data connection、452 Error writing file 等

TFTP(Trivial File Transfer Protocol)

TFTP is intended for application with simple needs:

Simple file transfer w/o authentication
Software is much smaller than FTP
Small size of software is important
e.g., in ROM for booting diskless workstations

Uses UDP instead of TCP

Less reliable
Using timeout and retransmission

TFTP Reliability:

Minimal error handling
Most errors result in termination

Web和HTTP

web page consists of objects
object can be HTML file, JPEG image, Java applet, audio file,…
web page consists of base HTML-file which includes several referenced objects
each object is addressable by a URL

web的应用层协议：超文本传输协议(HyperText Transfer Protocol, HTTP)

HTTP: hypertext transfer protocol – Web’s application layer protocol

client/server model

client: browser that requests, receives, (using HTTP protocol) and “displays” Web objects
server: Web server sends (using HTTP protocol) objects in response to requests

uses TCP:

client initiates TCP connection (creates socket) to server, port 80
server accepts TCP connection from client
HTTP messages (application-layer protocol messages) exchanged between browser (HTTP client) and Web server (HTTP server)
TCP connection closed

无状态协议(stateless protocol)

HTTP is “stateless”:
server maintains no information about past client requests

protocols that maintain “state” are complex!

past history (state) must be maintained
if server/client crashes, their views of “state” may be inconsistent, must be reconciled

非持续连接(non-persistent connection): 每一个请求/响应对是经一个单独的TCP连接发送，如HTTP1.0

non-persistent HTTP:

at most one object sent over TCP connection, connection then closed
downloading multiple objects required multiple connections

suppose user enters URL: www.someSchool.edu/someDepartment/home.index (contains text, references to 10 jpeg images)

1a. HTTP client initiates TCP connection to HTTP server (process) at www.someSchool.edu on port 80

1b. HTTP server at host www.someSchool.edu waiting for TCP connection at port 80. “accepts” connection, notifying client

HTTP client sends HTTP request message (containing URL) into TCP connection socket. Message indicates that client wants object someDepartment/home.index
HTTP server receives request message, forms response message containing requested object, and sends message into its socket
HTTP server closes TCP connection.
HTTP client receives response message containing html file, displays html. Parsing html file, finds 10 referenced jpeg objects
Steps 1-5 repeated for each of 10 jpeg objects

持续连接(persistent connection): 所有的请求及其响应经相同的TCP连接发送，如HTTP1.1

persistent HTTP:

multiple objects can be sent over single TCP connection between client, server

response time

往返时间(Round-Trip Time, RTT): 一个短分组从客户到服务器然后在返回客户的时间

RTT (definition): time for a small packet to travel from client to server and back

HTTP response time:

one RTT to initiate TCP connection
one RTT for HTTP request and first few bytes of HTTP response to return
file transmission time
non-persistent HTTP response time = 2RTT+ file transmission time

non-persistent HTTP issues:

requires 2 RTTs per object
OS overhead for each TCP connection
browsers often open parallel TCP connections to fetch referenced objects

persistent HTTP:

server leaves connection open after sending response
subsequent HTTP messages between same client/server sent over open connection
client sends requests as soon as it encounters a referenced object
as little as one RTT for all the referenced objects

HTTP报文格式

two types of HTTP messages: request, response

HTTP request message:

ASCII (human-readable format)

请求报文

请求行(request line): 方法字段、URL字段、HTTP版本字段
首部行(header line)(多个)
实体体(entity body)(POST方法才使用实体体)

POST method:

web page often includes form input
input is uploaded to server in entity body

URL method:

uses GET method
input is uploaded in URL field of request line: www.somesite.com/animalsearch?monkeys&banana

HTTP/1.0: GET / POST / HEAD(asks server to leave requested object out of response)

HTTP/1.1:

GET, POST, HEAD
PUT(uploads file in entity body to path specified in URL field)
DELETE(deletes file specified in the URL field)

响应报文

初始状态行(status line): 协议版本字段、状态码、相应状态信息
首部行(多个)
实体体(entity body)

status code appears in 1st line in server-to-client response message.

some sample codes:

200 OK(request succeeded, requested object later in this msg)
301 Moved Permanently(requested object moved, new location specified later in this msg (Location:))
400 Bad Request(request msg not understood by server)
404 Not Found(requested document not found on this server)
505 HTTP Version Not Supported

Trying out HTTP (client side) for yourself:

cookie技术有四个组件:

在HTTP响应报文中的一个cookie首部行
在HTTP请求报文中的一个cookie首部行
在用户端系统中保留有一个cookie文件，由浏览器进行管理
位于Web站点的一个后端数据库

many Web sites use cookies
four components:

1) cookie header line of HTTP response message
2) cookie header line in next HTTP request message
3) cookie file kept on user’s host, managed by user’s browser
4) back-end database at Web site

example:

Susan always access Internet from PC
visits specific e-commerce site for first time
when initial HTTP requests arrives at site, site creates:
- unique ID
- entry in backend database for ID

what cookies can be used for:

authorization
shopping carts
recommendations
user session state (Web e-mail)

how to keep “state”:

protocol endpoints: maintain state at sender/receiver over multiple transactions
cookies: http messages carry state

cookies and privacy:

cookies permit sites to learn a lot about you
you may supply name and e-mail to sites

Web缓存

Web缓存器(web cache)/代理服务器(proxy server): 代表初始Web服务器满足HTTP请求的网络实体

Web缓存器可以大大减少对客户请求的响应时间，特别是当客户与初始服务器之间的瓶颈带宽远低于客户与Web缓存器之间的瓶颈带宽时。

Web缓存器能够大大减少一个机构的接入链路到因特网的通信量。

goal: satisfy client request without involving origin server

user sets browser: Web accesses via cache

browser sends all HTTP requests to cache

object in cache: cache returns object
else cache requests object from origin server, then returns object to client

cache acts as both client and server

server for original requesting client
client to origin server

typically cache is installed by ISP (university, company, residential ISP)

why Web caching:

reduce response time for client request
reduce traffic on an institution’s access link
Internet dense with caches: enables “poor” content providers to effectively deliver content (so too does P2P file sharing)

example:

assumptions:

avg object size: 100K bits
avg request rate from browsers to origin servers:15/sec
avg data rate to browsers: 1.50 Mbps
RTT from institutional router to any origin server: 2 sec
access link rate: 1.54 Mbps

consequences:

LAN utilization: 0.15%
access link utilization = 99%
total delay = Internet delay + access delay + LAN delay
= 2 sec + minutes + usecs

example: fatter access link

raise access link rate to 154 Mbps
access link utilization will change to 0.99%
total delay will be 2sec and msecs and usecs

example: install local cache

Calculating access link utilization, delay with cache:

suppose cache hit rate is 0.4; 40% requests satisfied at cache, 60% requests satisfied at origin
access link utilization: 60% of requests use access link
data rate to browsers over access link = 0.6*1.50 Mbps = .9 Mbps; utilization = 0.9/1.54 = .58
total delay = 0.6 (delay from origin servers) + 0.4 (delay when satisfied at cache) = 0.6 (2.01) + 0.4 (~msecs) = ~ 1.2 secs less than with 154 Mbps link (and cheaper too!)

内容分发网络(Content Distribution Network, CDN): 地理上分散的Web缓存器

条件GET方法

conditional GET方法: HTTP协议允许缓存器证实它的对象是最新的的一种机制

请求报文使用GET方法，并且请求报文中包含一个”If-Modified-Since:”首部行

Goal: don’t send object if cache has up-to-date cached version

no object transmission delay
lower link utilization

cache: specify date of cached copy in HTTP request

If-modified-since:

server: response contains no object if cached copy is up-to-date:

HTTP/1.0 304 Not Modified

因特网中的电子邮件(Electronic mail)

3个主要组成部分: 用户代理(user agent)、邮件服务器(mail server)、简单邮件传输协议(Simple Mail Transfer Protocol, SMTP)

邮箱(mailbox)

User Agent

a.k.a. “mail reader”
composing, editing, reading mail messages; e.g., Outlook, Thunderbird, iPhone mail client
outgoing, incoming messages stored on server

mail servers:

mailbox contains incoming messages for user
message queue of outgoing (to be sent) mail messages
SMTP protocol between mail servers to send email messages
1. client: sending mail server
2. “server”: receiving mail server

SMTP

限制所有邮件报文的体部分职能采用7比特ASCII码表示

HTTP是一个拉协议(pull protocol): 某些人在Web服务器上装载信息，用户使用HTTP从该服务器拉取这些信息

SMTP是一个推协议(push protocol): 发送邮件服务器把文件推向接受邮件服务器

SMTP [RFC 2821]:

uses TCP to reliably transfer email message from client to server, port 25
direct transfer: sending server to receiving server
three phases of transfer

handshaking (greeting)
transfer of messages
closure

command/response interaction (like HTTP)

commands: ASCII text
response: status code and phrase

messages must be in 7-bit ASCII

SMTP uses persistent connections
SMTP requires message (header & body) to be in 7-bit ASCII
SMTP server uses CRLF.CRLF to determine end of message

comparison with HTTP:

HTTP: pull vs SMTP: push
both have ASCII command/response interaction, status codes
HTTP: each object encapsulated in its own response message vs SMTP: multiple objects sent in multipart message

邮件报文格式(Mail message format)和MIME

From: 首部行、To: 首部行、Subject: 首部行

SMTP: protocol for exchanging email messages

RFC 822: standard for text message format:

header lines, e.g.,

To:

From:

Subject:

different from SMTP MAIL FROM, RCPT TO: commands!
Body: the “message”

ASCII characters only

邮件访问协议

不能使用SMTP取回报文，取报文是一个拉操作而SMTP协议是一个推协议

SMTP: delivery/storage to receiver’s server

mail access protocol: retrieval from server

POP: Post Office Protocol [RFC 1939]: authorization, download
IMAP: Internet Mail Access Protocol [RFC 1730]: more features, including manipulation of stored messages on server
HTTP: gmail, Hotmail, Yahoo! Mail, etc.

第三版邮局协议(Post Office Protocol-Version 3, POP3)

三个阶段:

特许(authorization): 用户代理以明文形式发送用户名和口令以鉴别用户
事务处理: 用户代理取回报文，对报文做删除标记，取消报文删除标记，获取邮件的统计信息
更新: 客户发出quit命令退出POP3会话；邮件服务器删除被标记为删除的报文

邮件服务器回答: +OK正常 -ERR错误

authorization phase
client commands:

user: declare username
pass: password
server responses
+OK
-ERR
transaction phase, client:
list: list message numbers
retr: retrieve message by number
dele: delete
quit

more about POP3:

POP3 “download-and-keep”: copies of messages on different clients
POP3 is stateless across sessions

因特网邮件访问协议(Internet Mail Access Protocol, IMAP)

keeps all messages in one place: at server
allows user to organize messages in folders
keeps user state across sessions: names of folders and mappings between message IDs and folder name

DNS: 因特网的目录服务

主机使用IP地址(IP address)进行标识

域名系统(Domain Name System, DNS)将主机名解析为IP地址

Internet hosts, routers:

IP address (32 bit) - used for addressing datagrams
“name”, e.g., www.yahoo.com - used by humans

Q: how to map between IP address and name, and vice versa ?

Domain Name System:

distributed database implemented in hierarchy of many name servers

application-layer protocol: hosts, name servers communicate to resolve names (address/name translation)

note: core Internet function, implemented as application-layer protocol

complexity at network’s “edge

DNS是一个由分层的DNS服务器(DNS Server)实现的分布式数据库
DNS是一个使得主机能够查询分布式数据库应用层协议，运行在UDP之上，使用53号端口

主机别名(host aliasing): 通常规范主机名(canonical hostname)较为复杂
邮件服务器别名(mail server aliasing)
负载分配(load distribution): 冗余的服务器(e.g. Web服务器)

DNS services:

hostname to IP address translation
host aliasing(canonical, alias names)
mail server aliasing
load distribution(replicated Web servers: many IP addresses correspond to one name)

单一DNS服务器(集中式设计)的问题(why not centralize DNS?):

单点故障(a single point of failure): DNS服务器崩溃会导致整个互联网瘫痪
通信容量(traffic volume): DNS查询体量极其大
远距离集中式数据库(distant centralized database)
维护(maintenance)

分布式设计方案(分布式层次数据库)

根DNS服务器(13个)

顶级域DNS服务器(一般域名国家顶级域名反向域名)

权威DNS服务器

top-level domain (TLD) servers:

responsible for com, org, net, edu, aero, jobs, museums, and all top-level country domains, e.g.: uk, fr, ca, jp
Network Solutions maintains servers for .com TLD
Educause for .edu TLD

Top-Level Domain(TLD):

Generic domains:
.COM – commercial organizations
.EDU – educational institutions
.GOV – federal government institutions
.MIL – United States military
.NET – major network support centers
.ORG – other organizations
.INT – international organizations

Country Domains:

Inverse domain:

authoritative DNS servers:

organization’s own DNS server(s), providing authoritative hostname to IP mappings for organization’s named hosts
can be maintained by organization or service provider

Zones and domains:

Zone

A primary server loads all information from the disk file;
the secondary server loads all information from the primary server.
When the primary downloads information from the secondary, it is called zone transfer.

Authoritative Server Configuration(example):

本地DNS服务器(local DNS server): 严格上不属于DNS层次结构，但很重要

does not strictly belong to hierarchy
each ISP (residential ISP, company, university) has one(also called “default name server”)
when host makes DNS query, query is sent to its local DNS server

has local cache of recent name-to-address translation pairs (but may be out of date!)
acts as proxy, forwards query into hierarchy

Name Resolution:

Resolvers use UDP (single name) or TCP (whole group of names)
Knowing the address of the root server is sufficient
Resolvers use recursive query.
Servers use iterative query.

e.g.: client wants IP for www.amazon.com; 1st approximation:

client queries root server to find com DNS server
client queries .com DNS server to get amazon.com DNS server
client queries amazon.com DNS server to get IP address for www.amazon.com

contacted by local name server that can not resolve name

root name server:

contacts authoritative name server if name mapping not known
gets mapping
returns mapping to local name server

迭代查询(iterative query): 查询结果返回给请求者

contacted server replies with name of server to contact
“I don’t know this name, but ask this server”

host at cis.poly.edu wants IP address for gaia.cs.umass.edu:

递归查询(recusive query): 转发查询请求

puts burden of name resolution on contacted name server
heavy load at upper levels of hierarchy?

DNS Optimization:

Spatial Locality: Local computers referenced more often than remote
Temporal Locality: Same set of domains referenced repeatedly -> Caching
Caching: each entry has a time to live (TTL)
Replication: Multiple servers. Multiple roots. Ask the geographically closest server.

DNS缓存(DNS caching):

改善时延性能并减少在因特网上到处传输的DNS报文数量

once (any) name server learns mapping, it caches mapping

cache entries timeout (disappear) after some time (TTL)
TLD servers typically cached in local name servers, thus root name servers not often visited

cached entries may be out-of-date (best effort name-to-address translation!)

if name host changes IP address, may not be known Internet-wide until all TTLs expire

update/notify mechanisms proposed IETF standard, RFC 2136

DNS: distributed database storing resource records (RR, 提供主机名到IP地址的映射)

RR format: (name, type, value, ttl)

type=A

name is hostname
value is IP address

type=NS

name is domain (e.g., foo.com)
value is hostname of authoritative name server for this domain

type=CNAME

name is alias name for some “canonical” (the real) name
www.ibm.com is really servereast.backup2.ibm.com
value is canonical name

type=MX

value is name of mailserver associated with name

Inserting records into DNS:

example: new startup “Network Utopia”
register name networkutopia.com at DNS registrar (e.g., Network Solutions)

provide names, IP addresses of authoritative name server (primary and secondary)
register inserts two RRs into .com TLD server: (networkutopia.com, NS, dns1.networkutopia.com) (dns1.networkutopia.com, A, 212.212.212.1)

create authoritative server type A record for www.networkuptopia.com; type MX record for networkutopia.com

DNS报文: DNS查询和回答报文(query and reply both with same message format
)

message header:

dentification: 16 bit # for query, reply to query uses same #

flags:

query or reply
recursion desired
recursion available
reply is authoritative

Domain names stored as series of labels

1 octet length (n) (top two bits of length are 00)
n octet label
Zero length marks end of name

Attacking DNS

DDoS attacks

bombard root servers with traffic

not successful to date
traffic filtering
local DNS servers cache IPs of TLD servers, allowing root server bypass

bombard TLD servers

potentially more dangerous

redirect attacks

man-in-middle

Intercept queries

DNS poisoning

Send bogus relies to DNS server, which caches

exploit DNS for DDoS

send queries with spoofed source address: target IP

requires amplification

P2P应用

Objectives of P2P:

Share the resources (storage and bandwidth) of individual clients
to improve scalability/ robustness
Bypass DNS to find clients with resources! examples: instant messaging, skype

Pure P2P architecture

no always-on server
arbitrary end systems directly communicate
peers are intermittently connected and change IP addresses
40-70% of total traffic in many networks

P2P Examples:

File Sharing: BitTorrent, LimeWire
Streaming: PPLive, PPStream, Zatto, …
Research systems
Collaborative computing:
SETI@Home project
Human genome mapping
Intel NetBatch: 10,000 computers in 25 worldwide sites for simulations

File distribution: client-server vs P2P:
how much time to distribute file (size F) from one server to N peers

peer upload/download capacity is limited resource

client-server:

server transmission: must sequentially send (upload) N file copies:

time to send one copy: $F/u_s$
time to send N copies: $NF/u_s$

client: each client must download file copy

$d_{min}$ = min client download rate
min client download time: $F/d_{min}$

time to distribute F to N clients using client-server approach(increases linearly in N):
$D_{c-s}$ > max{$NF/us, F/d{min}$}

P2P:

server transmission: must upload at least one copy

time to send one copy: $F/u_s$

client: each client must download file copy

min client download time: $F/d_{min}$

time to distribute F to N clients using P2P approach(increases linearly in N, but so does this, as each peer brings service capacity):
$D_{P2P}$ > max{$F/us, F/d{min}, NF/(u_s + \sum u_i)$}

Napster(Centralized Database)

Program for sharing music over the Internet

Application-level, client-server protocol over TCP

A centralized index system that maps files (songs) to machines that are alive and with files

Steps:

Connect to Napster server
Upload your list of files (push) to server
Give server keywords to search the full list
Select “best” of hosts with answers

Napster Architecture

Napster Publish

Napster Search

Summary of features: a hybrid design

control: client-server (aka special DNS) for files
data: peer to peer

Advantages: simplicity, easy to implement sophisticated search engines on top of the index system

Disadvantages:

application specific (compared with DNS)
lack of robustness, scalability: central search server single point of bottleneck/failure
easy to sue !

BitTorrent

A P2P file sharing protocol
Created by Bram Cohen in 2004
A peer can download pieces concurrently from multiple locations
A global central index server is replaced by one tracker per file (called a swarm), reduces centralization; but needs other means to locate trackers
The bandwidth scalability management technique is more interesting

torrent(洪流): 参与一个特定文件分发的所有对等方(peer)的集合

追踪器(tracer)

Metadata File Structure: Meta info contains information necessary to contact the tracker and describes the files in the torrent

announce URL of tracker
file name
file length
piece length (typically 256KB)
SHA-1 hashes of pieces for verification
also creation date, comment, creator, …

Tracker Protocol:

Communicates with clients via HTTP/HTTPS

Client GET request

info_hash: uniquely identifies the file
peer_id: chosen by and uniquely identifies the client
client IP and port
numwant: how many peers to return (defaults to 50)
stats: e.g., bytes uploaded, downloaded

Tracker GET response

interval: how often to contact the tracker
list of peers, containing peer id, IP and port
stats

Robustness: A swarming protocol

Peers exchange info about other peers in the system
Peers exchange piece availability and request blocks from peers

块(chunk)

Peer Protocol:

Over TCP

Unchoke: indicate if A allows B to download
Interest/request: indicate which block to send from B to A

incentive

Periodically (typically every 10 seconds) calculate data-receiving rates from all peers
Upload to (unchoke) the fastest
- constant number (4) of unchoking slots
- partition upload bw equally among unchoked
commonly referred to as “tit-for-tat” strategy(一报还一报)

Optimistic Unchoking

Periodically select a peer at random and upload to it
- typically every 3 unchoking rounds (30 seconds)
Multi-purpose mechanism
- allow bootstrapping of new clients
- continuously look for the fastest peers (exploitation vs exploration)

Block Availability

Request (local) rarest first(最稀缺优先)
- achieves the fastest replication of rare pieces
- obtain something of value

Revisions

When downloading starts (first 4 pieces): choose at random and request them from the peers
- get pieces as quickly as possible
- obtain something to offer to others
Endgame mode
- defense against the “last-block problem”: cannot finish because missing a few last pieces
- send requests for missing sub-pieces to all peers in our peer list
- send cancel messages upon receipt of a sub-piece

Summary

Very widely used
- mainline: written in Python
- Azureus and μTorrent: the most popular
- Other popular clients: ABC, BitComet, BitLord, BitTornado, Opera browser
Many explorations, e.g.,
- BitThief
- BitTyrant
Better understanding is needed

Gnutella

文件共享网络

Decentralized Flooding: Gnutella

On startup, client contacts other servents (server + client) in network
to form interconnection/peering relationships
servent interconnection used to forward control (queries, hits, etc)

How to find a resource record: decentralized flooding

send requests to neighbors
neighbors recursively forward the requests

Each node forwards the query to its neighbors other than the onewho forwards it the query

Each node should keep track of forwarded queries to avoid loop:

nodes keep state (which will time out—soft state)

carry the state in the query, i.e. carry a list of visited nodes

Basic message header: Unique ID, TTL, Hops

Message types:

Ping : probes network for other servents
Pong : response to ping, contains IP addr, # of files, etc.
Query : search criteria + speed requirement of servent
QueryHit : successful response to Query, contains addr + port to transfer from, speed of servent, etc.
Ping, Queries are flooded
QueryHit, Pong: reverse path of previous message

Advantages: totally decentralized, highly robust

Disadvantages:

not scalable; the entire network can be swamped with flood requests: especially hard on slow clients; at some point broadcast traffic on Gnutella exceeded 56 kbps
to alleviate this problem, each request has a TTL to limit the scope
each query has an initial TTL, and each node forwarding it reduces it by one; if TTL reaches 0, the query is dropped (consequence?)

Flooding: FastTrack (aka Kazaa)

Modifies the Gnutella protocol into two-level hierarchy

Supernodes
- Nodes that have better connection to Internet
- Act as temporary indexing servers for other nodes
- Help improve the stability of the network
  Standard nodes
Connect to supernodes and report list of files
Search
Broadcast (Gnutella-style) search across supernodes
Disadvantages
Kept a centralized registration -> prone to law suits

分布式散列表(Distributed Hash Table, DHT)

DHT Overview

Abstraction: a distributed “hash-table” (DHT) data structure

put(key, value) and get(key) -> value
DHT imposes no structure/meaning on keys
one can build complex data structures using DHT

Implementation:

nodes in system form an interconnection network: ring, zone, tree, hypercube, butterfly network, …

DHT Applications

File sharing and backup [CFS, Ivy, OceanStore, PAST, Pastiche …]
Web cache and replica [Squirrel, Croquet Media Player]
Censor-resistant stores [Eternity]
DB query and indexing [PIER, Place Lab, VPN Index]
Event notification [Scribe]
Naming systems [ChordDNS, Twine, INS, HIP]
Communication primitives [I3, …]
Host mobility [DTN Tetherless Architecture]

Key Issues in Understanding a DHT Design:

How does the design map keys to internal representation (typically a metric space)?
Which space is a node responsible?
How are the nodes linked?

e.g. …

Content Addressable Network(CAN)

Abstraction

map a key to a “point” in a multi-dimensional Cartesian space
a node “owns” a zone in the overall space
route from one “point” to another

CAN Example: Two Dimensional Space

Space divided among nodes
Each node covers either a square or a rectangular area of ratios 1:2 or 2:1

CAN Insert Example:
node I::insert(K,V)

(1) a = hx(K)
b = hy(K)

(2) route(K,V) -> (a,b)

Routing:

A node maintains state only for its immediate neighboring nodes
Forward to neighbor which is closest to the target point
- a type of greedy, local routing scheme

(3) (K,V) is stored at (a,b)

CAN Retrieve Example:
node J::retrieve(K)

(1) a = hx(K)
b = hy(K)

(2) route “retrieve(K)” to (a,b)

CAN Insert: Join

CAN Evaluations

Guarantee to find an item if in the network
Load balancing
- hashing achieves some load balancing
- overloaded node replicates popular entries at neighbors
Scalability
- for a uniform (regularly) partitioned space with n nodes and d dimensions
- storage:
- - per node, number of neighbors is 2d
- - a fixed d can scale the network without increasing per-node state
routing
- average routing path is $(dn^{1/d})/3$ hops (due to Manhattan distance routing, expected hops in each dimension is dimension length * 1/3)

Chord: search by routing/consistent hashing

Space is a ring
Consistent hashing: m bit identifier space for both keys and nodes
- key identifier = SHA-1(key), where SHA-1() is a popular hash function:
- - Key=“Matrix3” -> ID=60
- node identifier = SHA-1(IP address)
- - IP=“198.10.10.1” -> ID=123

Chord: Storage using a Ring

A key is stored at its successor: node with next higher or equal ID

how to Search(One Extreme)

Every node knows of every other node
- Routing tables are large O(N)
- Lookups are fast O(1)

how to Search(the Other Extreme)

Every node knows its successor in the ring
- Routing tables are small O(1)
- Lookups are slow O(N)

Chord Solution: “finger tables”

Node K knows the node that is maintaining K + 2i,
- increase distance exponentially

Joining the Ring

use a contact node to obtain info
transfer keys from successor node to new node
updating fingers of existing nodes

DHT: Chord Node Join

Assume an identifier space [0..8]
Node n1 joins
Node n2 joins
Node n6 joins
Node n0 joined

DHT: Chord Insert Items

Nodes: n1, n2, n0, n6
Items: f7, f1

Upon receiving a query for item id, a node:

checks whether stores the item locally
if not, forwards the query to the largest node in its successor table that does not exceed id

Chord/CAN Summary

Each node “owns” some portion of the key-space
- in CAN, it is a multi-dimensional “zone”
- in Chord, it is the key-id-space between two nodes in 1-D ring
Files and nodes are assigned random locations in key-space
- provides some load balancing
- - probabilistically equal division of keys to nodes
Routing/search is local (distributed) and greedy
- node X does not know of a path to a key Z
- but if it appears that node Y is the closest to Z among all of the nodes known to X
- so route to Y

video streaming and content distribution networks (CDNs)(流式视频和内容分发网络)

Multimedia: video

video: sequence of images displayed at constant rate
- e.g., 24 images/sec
digital image: array of pixels
- each pixel represented by bits
coding: use redundancy within and between images to decrease # bits used to encode image
- spatial (within image)
- - spatial coding example: instead of sending N values of same color, send only two values: color value and number of repeated values (N)
- temporal (from one image to next)
- - temporal coding example: instead of sending complete frame at i+1, send only differences from frame i
CBR: (constant bit rate): video encoding rate fixed
VBR: (variable bit rate): video encoding rate changes as amount of spatial, temporal coding changes
examples:

MPEG 1 (CD-ROM) 1.5 Mbps
MPEG2 (DVD) 3-6 Mbps
MPEG4 (often used in Internet, < 1 Mbps)

Streaming stored video(流式存储视频)

simple scenario: video server(stored video) -> Internet -> client
challenge: heterogeneity
- different users have different capabilities (e.g., wired versus mobile; bandwidth rich versus bandwidth poor)

Streaming multimedia: DASH(Dynamic Adaptive Streaming over HTTP)

server:

divides video file into multiple chunks
each chunk stored, encoded at different rates
manifest file(文件清单): provides URLs for different chunks

client:

periodically measures server-to-client bandwidth
consulting manifest, requests one chunk at a time
chooses maximum coding rate sustainable given current bandwidth
can choose different coding rates at different points in time (depending on available bandwidth at time)

“intelligence” at client: client determines

when to request chunk (so that buffer starvation, or overflow does not occur)
what encoding rate to request (higher quality when more bandwidth available)
where to request chunk (can request from URL server that is “close” to client or has high available bandwidth)

Content distribution networks(CDN, 内容分发网络)

video traffic: major consumer of Internet bandwidth

challenge: how to stream content (selected from millions of videos) to hundreds of thousands of simultaneous users?

option 1: single, large “mega-server”

single point of failure
point of network congestion
long path to distant clients
multiple copies of video sent over outgoing link

quite simply: this solution doesn’t scale

option 2: store/serve multiple copies of videos at multiple geographically distributed sites (CDN)

enter deep: push CDN servers deep into many access networks
- close to users
- used by Akamai, 1700 locations
bring home: smaller number (10’s) of larger clusters in POPs near (but not within) access networks
- used by Limelight

CDN: stores copies of content at CDN nodes

e.g. Netflix stores copies of MadMen

subscriber requests content from CDN

directed to nearby copy, retrieves content
may choose different copy if network path congested

OTT(over the top) challenges: coping with a congested Internet

from which CDN node to retrieve content?
viewer behavior in presence of congestion?
what content to place in which CDN node?

Caching

explicit(明确的)
transparent (hijacking connections)
Replication
server farms(服务器群)
geographically dispersed(分散)(CDN)

Traditional: Performance

move content closer to the clients
avoid server bottlenecks
New: DDoS Protection
dissipate attack over massive resources
multiplicatively raise level of resources needed to attack

Denial of Service Attacks (DoS)

Distributed DoS (DDoS)

Redirection Overlay

Techniques
DNS

one name maps onto many addresses
works for both servers and reverse proxies
HTTP
requires an extra round trip
Router
one address, select a server (reverse proxy)
content-based routing (near client)
URL Rewriting
embedded links

Ridirection

Hashing Schemes: Modulo

Easy to compute
Evenly distributed
Good for fixed number of servers
Many mapping changes after a single server change

Consistent Hashing (CHash)

Hash server, then URL
Closest match
Only local mapping changes after adding or removing servers
Used by State-of-the-art CDNs

Highest Random Weight (HRW)

Hash(url, svrAddr)
Deterministic order of access set of servers
Different order for different URLs
Load evenly distributed after server changes

Redirection Strategies

Random (Rand)
- Requests randomly sent to cooperating servers
- Baseline case, no pathological behavior
Replicated Consistent Hashing (R-CHash)
- Each URL hashed to a fixed # of server replicas
- For each request, randomly select one replica
Replicated Highest Random Weight (R-HRW)
- Similar to R-CHash, but use HRW hashing
- Less likely two URLs have same set of replicas
Coarse(粗略) Dynamic Replication (CDR)
- Using HRW hashing to generate ordered server list
- Walk through server list to find a lightly loaded one
- of replicas for each URL dynamically adjusted
- Coarse grained server load information
Fine Dynamic Replication (FDR)
- Bookkeeping min # of replicas of URL (popularity)
- Let more popular URL use more replicas
- Keep less popular URL from extra replication

Replication

Why Replicate?

Performance
- keep copy close to remote users
- caching is a special case
Survive Failures
- availability: provide service during temporary failure
- fault tolerance: provide service despite catastrophic failure

Fault Models

Crashed
- failed device doesn’t do anything (i.e., fails silently)
Fail-Stop
- failed device tells you that it has failed
Byzantine(拜占庭将军问题,Byzantine Generals Problem)
- failed device can do anything
- adversary
- - playing a game against an evil opponent
- - opponent knows what you’re doing and tries to fool you
- - usually some limit on opponent’s actions (e.g. at most k failures)

拜占庭将军问题 (Byzantine Generals Problem)，是由莱斯利兰伯特提出的点对点通信中的基本问题。在分布式计算上，不同的计算机透过讯息交换，尝试达成共识；但有时候，系统上协调计算机 (Coordinator / Commander) 或成员计算机 (Member / Lieutanent) 可能因系统错误并交换错的讯息，导致影响最终的系统一致性。拜占庭将军问题就根据错误计算机的数量，寻找可能的解决办法 (但无法找到一个绝对的答案，只可以用来验证一个机制的有效程度)。

Synchrony

Assumptions concerning boundedness of component execution or network transmissions
Synchronous
- always performs function in a finite & known time bound
Asynchronous
- no such bound
Famous Result: A group of processes cannot agree on a value in an asynchronous system given a single crash failure

Network Partitions

Can’t tell the difference between a crashed process and a process that’s inaccessible due to a network failure.
Network Partition: network failure that cuts processes into two or more groups
- full communication within each group
- no communication between groups
- danger: each group thinks everyone else is dead

Mirroring

Goal: service up to K failures
Approach: keep K+1 copies of everything
Clients do operations on “primary” copy
Primary makes sure other copies do operations too
Advantage: simple
Disadvantages:
- do every operation K times
- use K times more storage than necessary

Mirroring Details

Optimization: contact one replica to read
What if a replica fails?
- get up-to-date data from primary after recovering
What if primary fails?
- elect a new primary

Election Problem

When algorithm terminates, all non-failed processes agree on which replica is the primary
Algorithm works despite arbitrary failures and recoveries during the election
If there are no more failures and recoveries, the algorithm must eventually terminate

Bully Algorithm

Use fixed “pecking order” among processes
- e.g., use network addresses
Idea: choose the “biggest” non-failed machine as primary
Correctness proof is difficult

Bully Algorithm Details

Process starts an election whenever it recovers or whenever primary has failed
- how to know primary has failed?(一段时间无响应)
To start an election, send election messages to all machines bigger than yourself
- if somebody responds with an ACK, give up
- if nobody ACKs, declare yourself the primary
On receiving election message, reply with ACK and start an election yourself (unless in progress)

Quorums

Quorum(法人): a set of server machines
Define what constitutes a “read quorum” and a “write quorum”
To write
- acquire locks on all members of some write quorum
- do writes on all locked servers
- release locks
To read: similar, but use read quorum
Correctness requirements
- any two write quorums must share a member
- any read quorum and any write quorum must share a member (read quorums need not overlap(覆盖))
Locking ensures that
- at most one write happening at a time
- never have a write and a read happening at the same time

Defining Quorums

Many alternatives
Example
- write quorum must contain all replicas
- read quorum may contain any one replica
Consequence
- writes are slow, reads are fast
- can write only if all replicas are available
- can read if any one replica is available
Example: Majority Quorum
- write quorum: any set with more than half the replicas
- read quorum: any set with more than half the replicas
Consequences
- modest(适度) performance for read and write
- can proceed(进行) as long as more than half the replicas are available

Quorums & Version Numbers

Write operation writes only a subset of the servers
- some servers are out-of-date
Remedy(纠正方法)
- put version number stamp on each item in each replica
- when acquiring locks, get current version number from each replica
- quorum overlap rules ensure that one member of your quorum has the latest version
- When reading, get the data from the latest version number in your quorum
- When writing, set version number of all replicas you wrote equal to 1 + (max version number in your quorum beforehand)
- Guarantees correctness even if no recovery action is taken when replica recovers from a crash

Quorums and Partitions

One group has a write quorum (and thus usually a read quorum);
- that group can do anything
- other groups are frozen
No group has a write quorum, but some groups have a read quorum
- some groups can read
- no groups can write
No group contains any quorum
- everyone is frozen

socket programming with UDP and TCP

What is “socket”

A socket is a virtual connection between two applications
Using a socket, two processes can communicate with each other
The socket is the major communication tool for Internet applications
A socket is bi-directional (full-duplex) transmission
A socket can be created dynamically

Socket as a virtual connection between two processes

Socket as a client/server model

Port: a logical connecting point at the transport-layer protocol.

Socket programming

Two socket types for two transport services:

UDP: unreliable datagram
TCP: reliable, byte stream-oriented

Application Example:

client reads a line of characters (data) from its keyboard and sends data to server
server receives the data and converts characters to uppercase
server sends modified data to client
client receives modified data and displays line on its screen

Socket programming with TCP

client must contact server
- server process must first be running
- server must have created socket (door) that welcomes client’s contact
client contacts server by:
- Creating TCP socket, specifying IP address, port number of server process
- when client creates socket: client TCP establishes connection to server TCP
- when contacted by client, server TCP creates new socket for server process to communicate with that particular client
- - allows server to talk with multiple clients
- - source port numbers used to distinguish clients

application viewpoint:
TCP provides reliable, in-order, byte-stream transfer (“pipe”) between client and server

Client/Server Process Organization

Unix Socket Programming Technical Details

Unix TCP server
(1) create socket:
socket_id = socket (AF_INET, SOCK_STREM, DEFAULT_PROTOCOL);
(2) bind socket:
bind (socket_id, server_addr, server_len);
(3) listen to socket:
listen (socket_id, number_of_connection);
(4) accept a connection:
accept (socket_id, &client_addr, &client_len);
(5) read (receive) data:
read (socket_id, buffer, buffer_len);
(6) write (send) data:
write (socket_id, buffer, buffer_len);
(7) close socket:
close(socket_id);

Unix TCP client
(1) create socket: same as server
socket_id = socket (AF_INET, SOCK_STREM, DEFAULT_PROTOCOL);
(2) connect socket:
connect (socket_id, serverINETaddress, server_len);
(3) write (send) data:
write (socket_id, buffer, buffer_len);
(4) read (receive) data:
read (socket_id, buffer, buffer_len);
(5) close socket: same as server
close(socket_id);

Step 1: socket(…) call

It declares a socket to be used.
Prepare data structure to manage socket
OS is responsible for this

Step 2: bind(…) call

It connects a process to a specific port
Port = A logical connecting point at a host for two communicating processes using socket
Port Numbers:
0~1023: System Reserved
Port 21: FTP
Port 23: telnet
Port 80: HTTP
1024 and above: available to users

Step 3: listen(…) call

listen() system call: prepare memory buffer for incoming connections
listen (socket_id, number_of_connection) -> We need to specify how many connection requests should be held in the buffer when SERVER is busy (can’t accept a request).

Step 4 - Part 1: accept(…) call

The server process accepts a request from a client
accept() function is a blocking function

Step 4 - Part 2: accept(…) call

The accept(_) call returns another port number and establish another connection

Step 5: read( ) and write( ) call

The server and client communicate using the second socket

Step 6: close ( ) call

Close the second socket and leave the first socket for next client

Step 7: Go back to accept(…) call

The server process goes back to the accept call

Winsock Programming Technical Details

Initialize Winsock

Step 1: Define your socket
Step 2: Initialize your socket
Step 3: Start using it

void main (void)
{
     /* The following two lines needed for Window's socket */
     // Winsock version 2.2
     WORD wVersionRequested = MAKEWORD(2,2);  /* Stuff for WSA functions   */
     WSADATA wsaData;                         /* Stuff for WSA functions   */
    /* This stuff initializes winsock*/
    WSAStartup(wVersionRequested, &wsaData);
    /* Create a socket */
    My_SocketID = socket (   …..   );

socket ( ) function : Returns socket ID on success

unsigned int socket_id = socket (AF_INET, SOCK_STREAM, 0);

bind ( ) function : Return code (< 0 if error)

int status = bind (socket_id, (struct sockaddr_in *) my_addr, sizeof(my_addr));

The sockaddr_in structure to specify port # and IP address of this machine (server machine)

The “sock_addr” structure
Step 1: You instantiate the structure
Step 2: Fill up the components

struct sockaddr_in   my_addr; /* My (client) Internet address */
/* Set My(client's) IP Address ---------------------------------------- */
my_addr.sin_family      = AF_INET;         /* Address Family To Be Used */  
my_addr.sin_port          = htons (MY_PORT_NUM);     /* Port number to use */ 
my_addr.sin_addr.s_addr = htonl (INADDR_ANY);      /* My IP address */

listen ( ) function : Return code (< 0 if error)

int status = listen (socket_id, 3);

The size of the connection request buffer(3)

accept ( ) function : duplicated socket ID (< 0 if error)

unsingned int child_sock = accept (socket_id, (struct sockaddr_in *) client_addr, sizeof (client_addr);

recv ( ) function : On success, the number of bytes received; Return code (< 0 if error)

int status = recv (child_sock, in_buffer, MAX_BUFFER_SIZE, 0);

The input (receive) buffer as a character string

Example: char in_buffer [MAX_BUFFER]

send ( ) function : On success, the number of bytes actually sent; Return code (< 0 if error)

int status = send (child_sock, out_buffer, MAX_BUFFER_SIZE, 0);

closesocket ( ) function : Return code (< 0 if error)

int status = closesocket (child_sock);

Clear winsock
After you call “closesocket” function but before your program is terminated

1 2	/* This stuff cleans-up winsock */ WSACleanup( );

How to specify your destination in socket?

Each destination for a socket connection is determined by < IP address + Port# >

sockaddr_in structure is used to define your destination

The sockaddr_in structure is defined in C/C++ struct
The sockaddr_in structure is defined in windows.h header file

struct sockaddr_in {
    u_char  sin_len;                /* Length of this structure */
    u_char  sin_family;           /* Network protocol used*/
    u_short sin_port;               /* Port number */
    struct  in_addr sin_addr;   /* Pointer to an IP address */
    char    sin_zero[8];            /* Extra information */
};

1
2
3

struct in_addr {
        u_long  s_addr;         /* Actual IP address */
};

How can I set the IP address and the port number of my destination?

STEP #1: Instantiate a sockaddr_in structure:
STEP #2: Set your destination IP address:
- Case 1: by “32-bit IP address”: server_address.sin_addr.s_addr = inet_addr(“146.163.147.59”);
- Case 2: by a host name: server_address.sin_addr.s_addr = inet_aton(“cougar.siue.edu”);
- Case 3: by a system-defined parameter: server_address.sin_addr.s_addr = htonl(INADDR_ANY);
STEP #3: Set your destination port number: server_address.sin_port = htons(80);

Python Programming Technical Details

Python TCPClient:

from socket import * #include Python’s socket library
serverName = ’servername’
serverPort = 12000
clientSocket = socket(AF_INET, SOCK_STREAM) create TCP socket for server, remote port 12000
clientSocket.connect((serverName,serverPort))
sentence = raw_input(‘Input lowercase sentence:’)
clientSocket.send(sentence.encode()) # No need to attach server name, port 
modifiedSentence = clientSocket.recv(1024)
print (‘From Server:’, modifiedSentence.decode())
clientSocket.close()

Python TCPServer:

 from socket import *
serverPort = 12000
serverSocket = socket(AF_INET,SOCK_STREAM) # create TCP welcoming socket
serverSocket.bind((‘’,serverPort))
serverSocket.listen(1) # server begins listening for  incoming TCP requests
print ‘The server is ready to receive’
while True: # loop forever
    # server waits on accept() for incoming requests, new socket created on return
    connectionSocket, addr = serverSocket.accept()
    # read bytes from socket (but not address as in UDP)
    sentence = connectionSocket.recv(1024).decode()
    capitalizedSentence = sentence.upper()
    connectionSocket.send(capitalizedSentence.encode())
    
    # close connection to this client (but not welcoming socket)
    connectionSocket.close()

Socket programming with UDP

UDP: no “connection” between client & server
- no handshaking before sending data
- sender explicitly attaches IP destination address and port # to each packet
- receiver extracts sender IP address and port# from received packet
UDP: transmitted data may be lost or received out-of-order

Application viewpoint: UDP provides unreliable transfer of groups of bytes (“datagrams”) between client and server

Client/server socket interaction: UDP

Python UDPClient

from socket import *
serverName = ‘hostname’
serverPort = 12000
# create UDP socket for server
clientSocket = socket(AF_INET, SOCK_DGRAM)
# get user keyboard input
message = raw_input(’Input lowercase sentence:’)
# Attach server name, port to message; send into socket
clientSocket.sendto(message.encode(), (serverName, serverPort))
# read reply characters from socket into string
modifiedMessage, serverAddress = clientSocket.recvfrom(2048)
# print out received string and close socket
print modifiedMessage.decode()
clientSocket.close()

from socket import *
serverPort = 12000
# create UDP socket
serverSocket = socket(AF_INET, SOCK_DGRAM)
# bind socket to local port number 12000
serverSocket.bind(('', serverPort))
print (“The server is ready to receive”)
while True: # loop forever
    # Read from UDP socket into message, getting client’s address (client IP and port)
    message, clientAddress = serverSocket.recvfrom(2048)
    modifiedMessage = message.decode().upper()
    # send upper case string back to this client
    serverSocket.sendto(modifiedMessage.encode(), clientAddress)